84        Bioinformatics

cd ref

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.

fa.gz

gunzip -d hg38.fa.gz

samtools faidx hg38.fa

bowtie2-build --thread 4 hg38.fa hg

cd ..

mkdir sam

bowtie2 -x \

ref/hg \

-1 data/SRR769545_1.fastq.gz \

-2 data/SRR769545_2.fastq.gz \

-S sam/SRR769545.sam

This time we have chosen to download the UCSC human reference genomes because the

chromosome names are given instead of accession numbers.

Once the SAM file has been created, we can convert it into a BAM file, sort it, and index

it using Samtools.

samtools view -uS -o sam/SRR769545.bam sam/SRR769545.sam

samtools sort -@ 4 -T bowtie2.tmp.sort \

-o sam/SRR769545_sorted.bam sam/SRR769545.bam

samtools index sam/SRR769545_sorted.bam

Then, we can use bcftools to detect the variants like variant substitutions and write that

information into a binary variant call format (BCF), which is the binary form of vari-

ant call format (VCF). This file includes the difference in genotype between the reference

genome and the genome of the individual sequenced. The variant discovery and VCF file

format will be discussed in detail in Chapter 4. However, they have to be used here to show

how the reference-guided genome assembly is performed. The bcftools can be installed in

Linux as follows:

sudo apt install bcftools

Now, we can use the “bcftools mpileup” command to collapse the pileup of the reads

aligned to the reference genomes in the BAM file and then to write only the variant infor-

mation into a VCF file. In the following, the variant information is written into a BCF file

and then that binary file is converted into a VCF file:

bcftools mpileup -f ref/hg38.fa \

sam/SRR769545_sorted.bam \

| bcftools call -mv -Ob \

-o sam/SRR769545.bcf

bcftools convert -O v \

-o SRR769545.vcf \

SRR769545.bcf